Protein Catalog Curation

Author

Joe Boktor

Published

January 24, 2023

Data sources

Unified Human Gastrointestinal Genome (UHGG/UHGP)

Global genome compendium curated from human stool samples and strain isolates from 31 countries across six continents. This catalog includes 289,232 prokaryotic genomes clustered into 4,744 species representatives. These genomes encode >170 million protein sequences, collated in the Unified Human Gastrointestinal Protein (UHGP) catalog. The UHGP more than doubles the number of gut proteins in comparison to those present in the Integrated Gene Catalog. More than 70% of the UHGG species lack cultured representatives, and 40% of the UHGP lack functional annotations (Almeida et al. 2021).

Human Reference Gut Microbiome - Korean, India, Japanese data-sets (HRGM-KIJ)

An addition of under-represented Asian countries to the UHGG v1 , with 90, 110, and 645 human microbiome samples collected from Korea, India, and Japan, respectively. Genome assembly produced 29,082 prokaryotic genomes from 845 fecal metagenomes (Merrill et al., n.d.).

Early-Life Gut Genomes (ELGG)

ELGG contains 32,277 genomes representing 2,172 species from 6,122 fecal metagenomes collected from children under 3 years old spanning delivery mode, gestational age, feeding pattern, and geography. The ELGG substantially expanded the phylogenetic diversity by 38% over the isolate microbial genomes, and the genomic landscape of the early-life microbiome by increasing recruitment of metagenomic reads to 82.8%. More than 60% of the ELGG species lack an isolate representative. The conspecific genomes of the most abundant species from children differed in gene diversity and functions compared to adults. The ELGG genomes encode over 80 million protein sequences, forming the Early-Life Gut Proteins (ELGP) catalog with over four million protein clusters, 29.5% of which lacked functional annotations (Zeng et al. 2022).

Hadza

Ultra-deep metagenomic sequencing and strain cultivation on 351 fecal samples from the Hadza, hunter-gatherers in Tanzania, and comparative populations in Nepal and California. This data set contains 94,971 genomes of bacteria, archaea, bacteriophages, and eukaryotes, 43% of which are absent from existing unified datasets (Merrill et al., n.d.).

RUMMETA

A ruminant gastrointestinal microbial reference catalog with 154,335,274 non-redundant genes and 8,745 uncultured species from over 10,373 metagenome-assembled genomes (MAGs). This genome compendium was built from 370 gastrointestinal (GIT) content samples, spanning 10 different GIT regions sampled from seven ruminant species (dairy cattle, Bos taurus; water buffalo, Bubalus bubalis; yak, Bos grunniens; goat, Capra aegagrus; sheep, Ovis aries; roe deer, Capreolus pygargus; water deer, Hydropotes inermis) (Xie et al. 2021).

Rumen-Uncultured Genomes (RUGs)

5,588 rumen MAGs collected from 283 ruminant beef cattle and 24 rumen fluid samples from six Boran (Bos indicus), indigenous Zebu cattle from sub-Saharan Africa (Wilkinson et al. 2020; Stewart et al. 2019).

EuPathDB

EuPathDB includes 11 databases (AmoebaDB, CryptoDB, FungiDB, GiardiaDB, HostDB, MicrosporidiaDB, OrthoMCLDB, PiroplasmaDB, PlasmoDB, ToxoDB, TrichDB, TriTrypDB, and VectorBase) which together represent 689 eukaryotic microorganisms. This data includes both known pathogens and other closely related non-infectious eukaryotic species(Aurrecoechea et al. 2013).

Metagenomic Gut Virus catalog (MGV)

189,680 viral genomes from 11,810 publicly available human stool metagenomes with 11,837,198 non-redundant proteins identified. Over 75% of genomes represent double-stranded DNA phages that infect members of the Bacteroidia and Clostridia classes. Sequence clustering reveals 54,118 candidate viral species, 92% of which were not found in existing databases. The Metagenomic Gut Virus catalog improves detection of viruses in stool metagenomes and accounts for nearly 40% of CRISPR spacers found in human gut Bacteria and Archaea (Nayfach et al. 2021).

WormBase ParaSite (WBPS17)

Databases consisting of 210 genomes, representing 169 distinct species from the Nemotoda (Roundworm) and Plathelminthes (Flatworm) phyla (Harris et al. 2020).

Data sources summary

Catalog Release Domain(s) Samples Genomes Proteins (95%) DOI catalog

UHGG/UHGP

Mgnify Unified Human Gastrointestinal Genomes/Proteins v2.0.1

2022

Bacteria

Archaea

25,088 + cultured genome resources 289,232 18,482,368

link

link

link

HRGM-KIJ

Human Reference Gut Microbiome (Korea, India, Japan)

2021

Bacteria

Archaea

845 29,082 6,312,390 link link

ELGG/ELGP

Early-Life Gut Genomes/Proteins

2022

Bacteria

Archaea

32,277 6,122 3,879,904 link link

Hadza

African Hadza tribe and comparative Nepali, and California samples

2022

Bacteria

Archaea

Virus

Eukaryote

388 94,971 32,431,351 link link

RUMMETA

10 GIT regions of seven ruminant species

2021

Bacteria

Archaeal

370 10,373 109,586,913 link link

RUGs

MGnify Cow Rumen v1.0

2019

Bacteria

Archaeal

207 5,588 5,502,611

link

link

link

MGV

Metagenomic Gut Virus Catalog

2021 Virus _ 189,680 1,725,656 link link
WormBase ParaSite (WBPS17) 2019 Eukaryote _ 210 10,328,252 link link

EuPathDB

eukaryotic pathogen database (Release 61)

2022 Eukaryote _ 689 3,456,934 link link

Analysis Setup

Code
library(tidyverse)
library(reticulate)
library(magrittr)
library(glue)
library(seqinr)
library(future)
library(future.batchtools)
library(furrr)
library(fs)
library(tictoc)
library(listenv)
library(progress)
library(tibble)
library(protr)
library(kableExtra)
library(plotly)
library(ggsci)
library(data.table)
library(ggside)

tmpdir <- "/central/scratch/jbok/tmp"
homedir <- "/central/groups/MazmanianLab/joeB"
wkdir <- glue(
  "{homedir}/",
  "Microbiota-Immunomodulation/Microbiota-Immunomodulation"
)
src_dir <- glue("{wkdir}/notebooks")
source(glue("{src_dir}/R-scripts/helpers_general.R"))
source(glue("{src_dir}/R-scripts/helpers_sequence-screens.R"))
protein_catalogs <- glue("{homedir}/Downloads/protein_catalogs")

Visualizing catalog summary statistics

Code
catalog_stats <- tribble(
  ~catalog, ~Bacteria, ~Archaea, ~Virus, ~Eukaryote, ~Genomes, ~Samples, ~proteins95,
  "UHGP", 1, 1, 0, 0, 289232, 25088, 18482368,
  "KIJ", 1, 1, 0, 0, 29082, 845, 6312390,
  "ELGP", 1, 1, 0, 0, 6122, 32277, 3879904,
  "Hadza", 1, 1, 1, 1, 94971, 388, 32431351,
  "RUMMETA", 1, 1, 0, 0, 10373, 370, 109586913,
  "RUGs", 1, 1, 0, 0, 5588, 207, 5502611,
  "MGV", 0, 0, 1, 0, 189680, NA, 1725656,
  "WormBase", 0, 0, 0, 1, 210, NA, 10328252,
  "EuPathDB", 0, 0, 0, 1, 689, NA, 3456934
)

p_bar1 <- catalog_stats %>%
  ggplot(aes(y = reorder(catalog, proteins95), x = proteins95)) +
  geom_col(width = 0.6, fill = "lightblue") +
  labs(y = NULL, x = "No. of proteins in each catalog (95% similarity)") +
  theme_light() +
  scale_x_continuous(expand = c(0, 0)) +
  theme(panel.grid.major.y = element_blank())
p_bar2 <- catalog_stats %>%
  ggplot(aes(y = reorder(catalog, Genomes), x = Genomes)) +
  geom_col(width = 0.6, fill = "lightblue") +
  labs(y = NULL, x = "No. of Assembled Genomes") +
  theme_light() +
  scale_x_continuous(expand = c(0, 0)) +
  theme(panel.grid.major.y = element_blank())
p_bar3 <- catalog_stats %>%
  drop_na(Samples) %>%
  ggplot(aes(y = reorder(catalog, Samples), x = Samples)) +
  geom_col(width = 0.6, fill = "lightblue") +
  labs(y = NULL, x = "No. of Mammalian Samples") +
  theme_light() +
  scale_x_continuous(expand = c(0, 0)) +
  theme(panel.grid.major.y = element_blank())

patch1 <- (p_bar1  / p_bar2 / p_bar3) +
  plot_layout(heights = c(1, 1, 0.666))

ggsave(patch1, 
  filename = glue("{wkdir}/figures/protein_catalog/summary_stats_barplot.png"),
  width = 6, height = 8, dpi = 300)
Code
p_donut <- catalog_stats %>%
  summarise_at(vars(Bacteria, Archaea, Virus, Eukaryote), sum) %>%
  pivot_longer(cols = everything(), names_to = "category", values_to = "count") %>%
  mutate(fraction = count / sum(count)) %>%
  mutate(ymax = cumsum(fraction)) %>%
  mutate(ymin = c(0, head(ymax, n = -1))) %>%
  mutate(labelPosition = (ymax + ymin) / 2) %>%
  mutate(label = paste0(category, "\n catalogs: ", count)) %>%
  ggplot(aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
  geom_rect() +
  geom_text( x=2, aes(y=labelPosition, label=label, color=category), size=5 ) +
  scale_fill_npg() +
  scale_color_npg() +
  coord_polar(theta="y") +
  xlim(c(-1, 4)) +
  theme_void() +
  theme(legend.position = "none")

ggsave(p_donut, 
  filename = glue("{wkdir}/figures/protein_catalog/domain_donut.png"),
  width = 8, height = 8, dpi = 300)

Representation of Domains

Catalog summary statistics

Downloading and Formatting Sequence Data

UHGP v 2.0

Code
uhgp_ftp_links <- 
  paste0(
    "http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/",
    "human-gut/v2.0.1/protein_catalogue/uhgp-", 
    c(100, 95, 90, 50), ".tar.gz"
  )

for (f in uhgp_ftp_links) {
  jname <- fs::path_ext_remove(basename(f))
  wget_download_slurm(
    jobname = jname,
    slurm_out = "/central/scratch/jbok/slurmdump/",
    download_link = f,
    output_dir = protein_catalogs
  )
}
Code
# Unzip uhgp-100.tar.gz
slurm_shell_do(
  cmd = glue("tar -xvzf {protein_catalogs}/UHGP_v2.0.1/uhgp-95.tar.gz",
    " -C {protein_catalogs}/UHGP_v2.0.1/"),
  memory = "200G",
  walltime = 144000
)

ELGG

Code
# Download 95% ELGG protein faa via https://zenodo.org/record/6969520/files/ELGP_95.faa.gz?download=1
# wget https://zenodo.org/record/6969520/files/ELGP_95.faa.gz 
wget_download_slurm(
  jobname = "ELGG",
  download_link = "https://zenodo.org/record/6969520/files/ELGP_95.faa.gz",
  slurm_out = "/central/scratch/jbok/slurmdump/",
  output_dir = protein_catalogs
  )
Code
# Unzip ELGP_95.faa.gz
slurm_shell_do(
  cmd = glue("gunzip {protein_catalogs}/ELGP_95.faa.gz"),
  memory = "50G",
  walltime = 36000
)
# shell_do("gunzip /central/groups/MazmanianLab/joeB/Downloads/protein_catalogs/ELGP_95.faa.gz")

KIJ

Code
# HRGM (KIJ 100 only) using 110 India, 645 Japan, and 90 Korean fecal samples
# https://www.mbiomenet.org/HRGM/data/protein_catalog/2.KIJ_CD-HIT-100_Proteins/KIJ_CD-HIT-100_Proteins.faa.gz
wget_download_slurm(
  jobname = "HRGM_KIJ100",
  download_link = "https://www.mbiomenet.org/HRGM/data/protein_catalog/2.KIJ_CD-HIT-100_Proteins/KIJ_CD-HIT-100_Proteins.faa.gz",
  slurm_out = "/central/scratch/jbok/slurmdump/",
  output_dir = protein_catalogs
)
Code
# Unzip
slurm_shell_do(
  cmd = glue("gunzip {protein_catalogs}/KIJ_CD-HIT-100_Proteins.faa.gz"),
  memory = "50G",
  walltime = 36000
)
# shell_do("gunzip /central/groups/MazmanianLab/joeB/Downloads/protein_catalogs/KIJ_CD-HIT-100_Proteins.faa.gz")

Hadza/Nepali Hunter-Gatherers

Code
# Downloading Hadza Mag metadata
shell_do(glue(
  "mkdir -p {wkdir}/data/input/protein_catalog_metadata"
))
shell_do(glue(
  "wget -P {wkdir}/data/input/protein_catalog_metadata/",
  " https://www.biorxiv.org/content/biorxiv/early/2022/10/07/2022.03.30.486478/DC2/embed/media-2.xlsx",
))
shell_do(glue(
  "mv {wkdir}/data/input/protein_catalog_metadata/media-2.xlsx",
  " {wkdir}/data/input/protein_catalog_metadata/Hadza-MAG-metadata.xlsx"
))
Code
hadza_meta_PRJEB49206 <-
  read.delim(
    glue(
      "{wkdir}/data/input/protein_catalog_metadata/",
      "filereport_analysis_PRJEB49206_tsv.txt"
    ),
    stringsAsFactors = FALSE, header = TRUE
  )

# get list of files that don't currently exist
hadza_files_unprocessed_lgr <-
  hadza_meta_PRJEB49206$submitted_ftp %>%
  purrr::map(~ !(
    file.exists(glue("{hadza_metagenomes_dir}/{basename(.)}")) |
      file.exists(glue(
        "{hadza_metagenomes_dir}/",
        "{fs::path_ext_remove(basename(.))}"
      ))
  ))

hadza_files_unprocessed <- 
  hadza_meta_PRJEB49206$submitted_ftp %>%
  keep(unlist(hadza_files_unprocessed_lgr))

# Initiate future.batchtools backend for parallel processing
future::plan(
  future.batchtools::batchtools_slurm,
  template = glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),
  resources = list(
    name = "hadza_agg-download",
    memory = "1G",
    ncpus = 2,
    walltime = 360000
  )
)

# Chunk files (100 per job) and download
tic()
n_jobs <- ceiling(length(hadza_files_unprocessed) / 100)
download_runs <- listenv()

for (job in 1:n_jobs) {
  input_chunk <- chunk_func(hadza_files_unprocessed, n_jobs)[[job]]
  download_runs[[job]] %<-% wget_path_list(
    download_list = input_chunk,
    download_dir = "/central/scratch/jbok/PRJEB49206_HADZA"
  )
}
toc()
# list.files("/central/scratch/jbok/PRJEB49206_HADZA") %>% length
Code
#______________________________________________________________________
# Prodigal on Hadza MAGs ----
# Initiate future.batchtools backend for parallel processing
future::plan(
  future.batchtools::batchtools_slurm,
  template = glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),
  resources = list(
    name = "prodigal-Hadza-MAGs",
    memory = "5G",
    ncpus = 1,
    walltime = 360000
  )
)

# select input genome fasta files paths for prodigal
hadza_proteins_dir <- glue(
  "{homedir}/Downloads/",
  "protein_catalogs/hadza_proteins"
)
hadza_mag_paths <-
  list.files(
    path = "/central/scratch/jbok/PRJEB49206_HADZA",
    full.names = TRUE
  )
hadza_mag_unprocessed_lgr <-
  hadza_mag_paths %>%
  purrr::map( ~ !file.exists(glue("{hadza_proteins_dir}/{basename(.)}")))

hadza_mag_paths_to_process <-
  hadza_mag_paths %>%
  keep(unlist(hadza_mag_unprocessed_lgr))

# generate list of desired fasta outputs
hadza_proteins_paths <-
  purrr::map(
    hadza_mag_paths_to_process,
    ~ glue("{hadza_proteins_dir}/{fs::path_ext_remove(basename(.))}")
  )

# Chunk files (500 per job) and download
tic()
n_jobs <- ceiling(length(hadza_mag_paths_to_process) / 25)
prodigal_runs <- listenv()

for (job in 1:n_jobs) {
  input_list <- chunk_func(hadza_mag_paths_to_process, n_jobs)[[job]]
  output_list <- chunk_func(hadza_proteins_paths, n_jobs)[[job]]
  prodigal_runs[[job]]  %<-% shell_do_prodigal_list(
    input_genome_list = input_list,
    output_fasta_list = output_list
  )
}
toc()

hadza_files <- list.files(
  glue("{protein_catalogs}/hadza_proteins"),
  full.names = TRUE
  )

hadza_files %>% length
Code
pb <- progress_bar$new(total = length(hadza_files),
  format = "[:bar] :current/:total (:percent)"
  )

for (f in hadza_files){
  pb$tick()
  shell_do(glue(
    "cat {f} >> {protein_catalogs}/hadza.fasta"
    ))
}

Ruminants - RUMMETA

Code
#' https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01078-x 
#' An integrated gene catalog and over 10,000 metagenome-assembled genomes 
#' from the gastrointestinal microbiome of ruminants
# http://rummeta.njau.edu.cn/rumment/download/downloadFile?file=RGMGC.geneSet.faa.gz

wget_download_slurm(
  jobname = "RUMMETA",
  download_link = "http://rummeta.njau.edu.cn/rumment/download/RGMGC.geneSet.faa.gz", 
  slurm_out = "/central/scratch/jbok/slurmdump/",
  output_dir = protein_catalogs
  )
Code
slurm_shell_do(
  cmd = glue("gunzip {protein_catalogs}/RGMGC.geneSet.faa.gz"),
  memory = "50G",
  walltime = 36000
)

Ruminants - RUGs

Code
rug_ftp_links <- 
  paste0(
    "http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/",
    "cow-rumen/v1.0/protein_catalogue/protein_catalogue-", 
    c(100, 95, 90, 50), ".tar.gz")

for (f in rug_ftp_links) {
  jname <- fs::path_ext_remove(basename(f))
  wget_download_slurm(
    jobname = jname,
    slurm_out = "/central/scratch/jbok/slurmdump/",
    download_link = f,
    output_dir = glue("{protein_catalogs}/cow-rumen-v1.0")
  )
}
Code
slurm_shell_do(
  cmd = glue("tar -xvzf {protein_catalogs}/cow-rumen-v1.0/protein_catalogue-95.tar.gz",
  " -C {protein_catalogs}/cow-rumen-v1.0/"),
  memory = "50G",
  walltime = 36000
)

MGV

Code
wget_download_slurm(
  jobname = "MGV_proteins",
  download_link = "https://portal.nersc.gov/MGV/MGV_v1.0_2021_07_08/mgv_proteins.faa", 
  slurm_out = "/central/scratch/jbok/slurmdump/",
  output_dir = protein_catalogs
  )

VEuPathDB

Code
protein_catalogs <- glue("{homedir}/", "Downloads/protein_catalogs/")
VEuPathDB_meta <-
  read.delim(glue("{wkdir}/data/input/protein_catalog_metadata/VEuPathDB_release-61_2022-12-15.txt"),
    stringsAsFactors = FALSE, header = TRUE
  )

# checking length of data
VEuPathDB_links <- VEuPathDB_meta$protein_fasta_link %>% unique()
VEuPathDB_links %>% length()

for (f in VEuPathDB_links) {
  if (!file.exists(glue("{protein_catalogs}/VEuPathDB_v61/{basename(f)}"))){
    jname <- fs::path_ext_remove(basename(f))
    wget_download_slurm(
      jobname = jname,
      slurm_out = "/central/scratch/jbok/slurmdump/",
      download_link = f,
      output_dir = glue("{protein_catalogs}/VEuPathDB_v61")
    )
  Sys.sleep(5)
  }
}
Code
# Some runs failed, collect and examine filepaths
download_failed <- list()
for (f in VEuPathDB_links) {
  if (!file.exists(glue("{protein_catalogs}/VEuPathDB_v61/{basename(f)}"))){
    jname <- fs::path_ext_remove(basename(f))
    download_failed[[jname]] <- f
  }
}

# rerun download failures, downloading genomes (nt) instead of proteins (aa)
download_failed %<>% purrr::map(
  ~ gsub("AnnotatedProteins", "Genome", .)
)
for (f in download_failed) {
  if (!file.exists(glue("{protein_catalogs}/VEuPathDB_v61/{basename(f)}"))){
    jname <- fs::path_ext_remove(basename(f))
    wget_download_slurm(
      jobname = jname,
      slurm_out = "/central/scratch/jbok/slurmdump/",
      download_link = f,
      output_dir = glue("{protein_catalogs}/VEuPathDB_v61/Genomes/")
    )
  Sys.sleep(1)
  }
}

# Initiate future.batchtools backend for parallel processing
future::plan(
  future.batchtools::batchtools_slurm,
  template = glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),
  resources = list(
    name = "prodigal-VEuPathDB-Genomes",
    memory = "5G",
    ncpus = 1,
    walltime = 36000
  )
)

VEuPathDB_genome_paths <-
  list.files(
    glue("{homedir}/Downloads/protein_catalogs/VEuPathDB_v61/Genomes"),
    full.names = TRUE
  )

for (genome_path in VEuPathDB_genome_paths) {
  output_file_name <- gsub(
    "Genome", "AnnotatedProteins", basename(genome_path)
  )
  output_file <-
    glue("{homedir}/Downloads/protein_catalogs/VEuPathDB_v61/{output_file_name}")
  fout %<-% {
    shell_do_prodigal(
      input_genome = genome_path,
      output_fasta = output_file
    )
  }
}
Code
# concatenate individual fasta files
slurm_shell_do(
  cmd = glue(
    "cat {protein_catalogs}/VEuPathDB_v61/*.f* >",
    " {protein_catalogs}/VEuPathDB_v61.fasta"
  ),
  memory = "10G"
)

WormBase

Code
wormbase_meta <-
  read.delim(
    glue(
      "{wkdir}/data/input/protein_catalog_metadata/",
      "2023-02-09_WormBase-genomes.txt"
    ),
    stringsAsFactors = FALSE, header = TRUE
  )

for (f in wormbase_meta$Proteins) {
  jname <- fs::path_ext_remove(basename(f))
  if (!file.exists(glue("{protein_catalogs}/wormbase-v17.0/{basename(f)}"))){
    wget_download_slurm(
      jobname = jname,
      slurm_out = "/central/scratch/jbok/slurmdump/",
      download_link = f,
      output_dir =  glue("{protein_catalogs}/wormbase-v17.0")
    )
    Sys.sleep(0.5)
  }
}
Code
slurm_shell_do(
  cmd = glue("gunzip {protein_catalogs}/wormbase-v17.0/*.gz"),
  memory = "5G",
  walltime = 360000
)
Code
# concatenate individual fasta files
slurm_shell_do(
  cmd = glue(
    "cat {protein_catalogs}/wormbase-v17.0/* >",
    " {protein_catalogs}/wormbase-v17.0.fasta"
  ),
  memory = "10G"
)

Catalog Assembly

MMSeq2 95% similarity clustering of each catalog independently to remove redundant sequences

Code
# Utility function for running MMSeq2 clustering on the HPC
slurm_shell_do_mmseq2_clust <- function(input_fasta,
                                        output,
                                        id_threshold = 0.95,
                                        jobname = glue("MMSeq2-{rand_string()}"),
                                        memory = "120G", # per cpu
                                        ncpus = 12,
                                        walltime = 36000,
                                        tmpdir = "/central/scratch/jbok/tmp",
                                        working_dir = wkdir) {
  source(glue("{working_dir}/notebooks/R-scripts/helpers_general.R"))
  mmseq_cmd <- glue(
    "mmseqs easy-linclust --cov-mode 1 -c 0.8",
    " --kmer-per-seq 80",
    " --min-seq-id {id_threshold}",
    " --threads {ncpus}",
    " {input_fasta}",
    " {output}",
    " {tmpdir}"
  )
  slurm_shell_do(
    template_path = glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),
    cmd = mmseq_cmd,
    jobname = jobname,
    memory = memory,
    ncpus = ncpus,
    walltime = walltime,
    working_dir = working_dir
  )
}
Code
catalog_paths <- list(
  "UHGP" = glue("{protein_catalogs}/UHGP_v2.0.1/uhgp-95/uhgp-95.faa"),
  "ELGG" = glue("{protein_catalogs}/ELGP_95.faa"),
  "KIJ" = glue("{protein_catalogs}/KIJ_CD-HIT-100_Proteins.faa"),
  "Hadza" = glue("{protein_catalogs}/hadza.fasta"),
  "RUMMETA" = glue("{protein_catalogs}/RGMGC.geneSet.faa"),
  "RUGS" = glue("{protein_catalogs}/cow-rumen-v1.0/protein_catalogue-95/protein_catalogue-95.faa"),
  "MGV" = glue("{protein_catalogs}/mgv_proteins.faa"),
  "VEuPathDB" = glue("{protein_catalogs}/VEuPathDB_v61.fasta"),
  "Wormbase" = glue("{protein_catalogs}/wormbase-v17.0.fasta")
)
Code
for (catalog in names(catalog_paths)) {
  print(catalog)
  output_dir <- glue("{protein_catalogs}/clustered_catalogs/{catalog}")
  shell_do(glue("mkdir -p {output_dir}"))
  if (file.exists(
    glue("{output_dir}/{catalog}-MMSeq2-95_rep_seq.fasta")
  )) {
    message(catalog, " file already exists ...")
    next
  }
  slurm_shell_do_mmseq2_clust(
    input_fasta = catalog_paths[[catalog]],
    output = glue("{output_dir}/{catalog}-MMSeq2-95")
  )
}

Adding a unique catalog ID to the headers of all FASTAs

Code
clustered_rep_paths <-
  names(catalog_paths) %>%
  purrr::set_names() %>%
  purrr::map(
    ~ glue(
      "{protein_catalogs}/clustered_catalogs/{.}/",
      "{.}-MMSeq2-95_rep_seq.fasta"
    ))

# check that files exist
if (!all(purrr::map(clustered_rep_paths, ~ file.exists(.)) %>% unlist )) stop()

clustered_named_rep_paths <- list()
for (catalog in names(clustered_rep_paths)) {
  clustered_named_rep_paths[[catalog]] <- glue(
      "{protein_catalogs}/clustered_catalogs/{catalog}/",
      "{catalog}-MMSeq2-95_rep_seq_renamed.fasta"
    )
  if (file.exists(clustered_named_rep_paths[[catalog]])) {
    message(catalog, " file already exists ...")
    next
  }
  message("Annotating: ", catalog)
  slurm_shell_do(
    glue(
      "/home/jboktor/bbmap/bbrename.sh",
      " in={clustered_rep_paths[[catalog]]}",
      " out={clustered_named_rep_paths[[catalog]]}",
      " prefix=CATID_{catalog}",
      " addprefix=t",
      " fixjunk=t",
      " -Xmx100g"
    ),
    memory = "100G", 
    walltime = 36000
  )
  Sys.sleep(2)
}

Collecting summary metrics for each clustered catalog

Code
clust_catalog_protein_n <- list()
for (catalog in names(clustered_rep_paths)) {
  message("Counting proteins in: ", catalog)
  clust_catalog_protein_n[[catalog]] <- 
    shell_do(
        glue("grep -c '^>' {clustered_rep_paths[[catalog]]}"),
        stdout_path = TRUE
    )  %>% 
    as.numeric
}

Concatenating clustered FASTAs

Code
shell_do(glue("mkdir -p {protein_catalogs}/clustered_catalogs/merged"))

# check that files exist
if (!all(purrr::map(
  clustered_named_rep_paths, ~ file.exists(.)
  ) %>% unlist )) stop()

for (catalog_path in clustered_named_rep_paths) {
  message("Merging", catalog_path)
  # pause execution until previous job is complete
  check_slurm_overload(njobs = 1, interval = 5)
  slurm_shell_do(glue(
    "cat {catalog_path} >> ",
    "{protein_catalogs}/clustered_catalogs/merged/",
    "2023-02-13_concatenated-catalog.fasta"
  ), memory = "50G")
}

MMSeq2 similarity clustering of concatenated catalogs

Code
# Clustering at 95% (kmer for seq 200)
slurm_shell_do_mmseq2_clust(
  input_fasta = glue(
    "{protein_catalogs}/clustered_catalogs/",
    "merged/2023-02-13_concatenated-catalog.fasta"
  ),
  output = glue(
    "{protein_catalogs}/clustered_catalogs/",
    "merged/2023-02-13_catalog_MMSeq2-95"
  ),
  memory = "128G",
  ncpus = 8
)

# 90% Identity (kmer for seq 200)
slurm_shell_do_mmseq2_clust(
  input_fasta = glue(
    "{protein_catalogs}/clustered_catalogs/",
    "merged/2023-02-13_concatenated-catalog.fasta"
  ),
  output = glue(
    "{protein_catalogs}/clustered_catalogs/",
    "merged/2023-02-13_catalog_MMSeq2-90"
  ),
  id_threshold = 0.9,
  memory = "10G",
  ncpus = 8
)

# 50% Identity (kmer for seq 80)
slurm_shell_do_mmseq2_clust(
  input_fasta = glue(
    "{protein_catalogs}/clustered_catalogs/",
    "merged/2023-02-13_concatenated-catalog.fasta"
  ),
  output = glue(
    "{protein_catalogs}/clustered_catalogs/",
    "merged/2023-02-13_catalog_MMSeq2-50"
  ),
  walltime = 270000,
  id_threshold = 0.5,
  memory = "250G",
  ncpus = 4
)

Chunking protein catalog into 50 bins

Code
#' function to divide original fasta into N smaller fasta files, roughly even in size
#' using bash awk
get_awk_chunk_cmd <- function(fasta_path, nbins, output_dir){
  filename <- fs::path_ext_remove(basename(fasta_path))
  protein_n <- shell_do(glue("grep -c '^>' {fasta_path}"), 
    stdout_path = TRUE
    ) %>% 
    as.numeric()
  proteins_per_file <- ceiling(protein_n/nbins)
  chunk_cmd <- glue(
    "awk 'BEGIN {n=0;} /^>/ ",
    "{if(n%{{proteins_per_file}==0)",
    "{file=sprintf(\"{{output_dir}/{{filename}_chunk_%d.fasta\",n);} ",
    "print >> file; n++; next;} { print >> file; }' ",
    "< {{fasta_path}", 
     .open = "{{")
  return(chunk_cmd)
}
Code
slurm_shell_do(
  cmd = get_awk_chunk_cmd(
    fasta_path = 
      glue("{protein_catalogs}/clustered_catalogs/merged/",
        "2023-02-13_catalog_MMSeq2-95_rep_seq.fasta"),
    output_dir = 
      glue("{protein_catalogs}/clustered_catalogs/merged/chunked_fasta"),
    nbins = 50
    ),
    jobname = glue("chunk-catalog-{rand_string()}"), working_dir = wkdir, 
    memory = "100G", 
    ncpus = 1, 
    walltime = 360000
)

Calculating the number of unique proteins in the curated catalog at each clustering threshold

Code
catalog_clustered_rep_paths <- list.files(
  glue("{protein_catalogs}/clustered_catalogs/merged"), 
  pattern = "rep_seq.fasta",
  full.names = TRUE
)
repseq_protein_n <- list()
for (catalog in catalog_clustered_rep_paths) {
  message("Counting proteins in: \n", catalog)
  repseq_protein_n[[catalog]] <- 
    shell_do(
        glue("grep -c '^>' {catalog}"),
        stdout_path = TRUE
    )  %>% 
    as.numeric
    repseq_protein_n[[catalog]] %>% print()
}
Code
catalog_n_data <-  tribble(
  ~`Clustering Threshold`, ~`# of Proteins`,
  "50%", 58025475,
  "90%", 133902480,
  "95%", 164949326,
)

catalog_n_data %>%
    kable(format.args = list(big.mark = ",")) %>% 
    kable_styling(
      font_size = 15,
      bootstrap_options = c("striped", "hover", "condensed"),
      full_width = F
      )
Clustering Threshold # of Proteins
50% 58,025,475
90% 133,902,480
95% 164,949,326

Gathering representative sequences from 50% clustered catalog

Code
shell_do(
  glue(
    "cd {protein_catalogs}/clustered_catalogs/merged &&"
    " grep -o '^>.*' 2023-02-13_catalog_MMSeq2-50_rep_seq.fasta >",
    " 2023-02-13_catalog_MMSeq2-50_rep_seq_headers.txt"
    )
)
Code
# Read in txt file with 50% clustered catalog headers
cluster_50_headers <- read.delim(
  glue(
    "{protein_catalogs}/clustered_catalogs/merged/",
    "2023-02-13_catalog_MMSeq2-50_rep_seq_headers.txt"
    ),
    stringsAsFactors = F, header = F)

set.seed(42)
cluster_50_headers_downsample <- cluster_50_headers %>%
  as.data.table() %>%
  slice_sample(n = 500000) %>%
  mutate(V1 = gsub(">", "", V1) %>% str_trim(side = "right")) %>%
  pull(V1)

# save downsampled fasta
extract_seq_from_catalog(
  fasta_header_list = cluster_50_headers_downsample,
  output_fasta_path = glue(
    "{protein_catalogs}/clustered_catalogs/merged/",
    "2023-02-13_catalog_MMSeq2-50_500k-subsample.fasta"
  ),
  catalog_path = glue(
    "{protein_catalogs}/clustered_catalogs/merged/",
    "2023-02-13_catalog_MMSeq2-50_rep_seq.fasta"
  )
)

subsamp550k <- protr::readFASTA(
  glue(
    "{protein_catalogs}/clustered_catalogs/merged/",
    "2023-02-13_catalog_MMSeq2-50_500k-subsample.fasta"
  )
)
saveRDS(
  subsamp550k,
  glue("{wkdir}/data/interim/protein_catalog/{Sys.Date()}_custom_catalog_500k.rds")
)

Calculating Molecular Descriptors for Protein Catalog

Calculating peptide lengths and filtering peptides shorter than 7 amino acids

Code
subsamp500k <- readRDS(
  glue(
    "{wkdir}/data/interim/protein_catalog/",
    "2023-04-24_custom_catalog_500k.rds"
  )
)
seq_lengths <- subsamp500k %>% nchar() %>% as.list()
# select proteins that are at least 7 amino acids long
filtered_seq_list <- seq_lengths %>% keep(. > 6)
filtered_subsample_seqs <- subsamp500k %>%
  keep(names(.) %in% names(filtered_seq_list))

set.seed(42)
random_catalog_balanced_headers_n100 <-
  data.frame("fasta_headers" = names(filtered_subsample_seqs)) %>%
  mutate(catalog = strex::str_before_nth(fasta_headers, "_", 2) %>% gsub("CATID_", "", .)) %>%
  group_by(catalog) %>%
  slice_sample(n = 100) %>%
  pull(fasta_headers)
saveRDS(
  random_catalog_balanced_headers_n100,
  glue("{wkdir}/data/interim/protein_catalog/random_catalog_balanced_headers_n100.rds")
)

seq_lengths_df <- as.data.frame(seq_lengths)  %>% 
  pivot_longer(everything(), names_to = "id", values_to = "protein_length") %>% 
  distinct()
saveRDS(
  seq_lengths_df,
  glue("{wkdir}/data/interim/protein_catalog/seq_lengths_df.rds")
)

Visualizing histogram of sequence lengths

Code
seq_lengths_df <- readRDS(
  glue("{wkdir}/data/interim/protein_catalog/seq_lengths_df.rds")
)

p_cat50_seqhist <- seq_lengths_df %>%
  ggplot(aes(x = protein_length)) +
  scale_x_log10() +
  geom_histogram(bins = 250, fill = "lightblue") +
  labs(x = "Protein Length", y = "Count") +
  theme_bw()

ggsave(
  glue("{wkdir}/figures/protein_catalog/sequence-length-hist.png"),
  p_cat50_seqhist,
  width = 6, height = 5, dpi = 300
)

Catalog sequence lengths

Calculating Normalized Moreau-Broto autocorrelation descriptors

Code
protr_descriptors <- tribble(
  ~Accession, ~Molecular_Descriptor,
  "CIDH920105", "Normalized Average Hydrophobicity Scales",
  "BHAR880101", "Average Flexibility Indices",
  "CHAM820101", "Polarizability Parameter",
  "CHAM820102", "Free Energy of Solution in Water, kcal/mole",
  "CHOC760101", "Residue Accessible Surface Area in Tripeptide",
  "BIGC670101", "Residue Volume",
  "CHAM810101", "Steric Parameter",
  "DAYM780201", "Relative Mutability"
)
protr_descriptors %>%
    kable() %>% 
    kable_styling(
      font_size = 15,
      bootstrap_options = c("striped", "hover", "condensed"),
      full_width = F
      )
Accession Molecular_Descriptor
CIDH920105 Normalized Average Hydrophobicity Scales
BHAR880101 Average Flexibility Indices
CHAM820101 Polarizability Parameter
CHAM820102 Free Energy of Solution in Water, kcal/mole
CHOC760101 Residue Accessible Surface Area in Tripeptide
BIGC670101 Residue Volume
CHAM810101 Steric Parameter
DAYM780201 Relative Mutability
Code
library(protr)
possiblyextractMoreauBroto <- possibly(.f = extractMoreauBroto, otherwise = NULL)
possiblyextractDescScales <- possibly(.f = extractDescScales, otherwise = NULL)

random_catalog_balanced_headers_n100 <- readRDS(
  glue("{wkdir}/data/interim/protein_catalog/random_catalog_balanced_headers_n100.rds")
)

future::plan(multisession)
tic()
proteochemical_descriptors <-
  filtered_subsample_seqs[random_catalog_balanced_headers_n100] %>%
  furrr::future_map(~possiblyextractMoreauBroto(., nlag = 7)) %>%
  bind_rows(.id = "id")
toc()
saveRDS(
  proteochemical_descriptors,
  glue("{wkdir}/data/interim/protein_catalog/extractMoreauBroto.rds")
)

future::plan(multisession)
tic()
molPropDescriptors <-
  filtered_subsample_seqs[random_catalog_balanced_headers_n100] %>%
  furrr::future_map(~possiblyextractDescScales(
    ., propmat = "AAMolProp", pc = 3, lag = 7, silent = TRUE)
    ) %>%
  bind_rows(.id = "id")
toc()
saveRDS(
  molPropDescriptors,
  glue("{wkdir}/data/interim/protein_catalog/molPropDescriptors.rds")
)

Visualizing autocorrelation descriptors

Code
MoreauBrotoDescriptors <- readRDS(
  glue("{wkdir}/data/interim/protein_catalog/extractMoreauBroto.rds")
)
pca_df <- MoreauBrotoDescriptors %>%
  distinct() %>%
  mutate(catalog = strex::str_before_nth(id, "_", 2) %>% gsub("CATID_", "", .))

pca_df %>%
  plot_ly(
    x = ~CHOC760101.lag7,
    y = ~CHAM820101.lag7,
    z = ~DAYM780201.lag7,
    text = ~ paste("ID:", id),
    mode = "markers", marker = list(size = 6)
  ) %>% add_markers(
    color = ~ catalog,
    colors = pal_npg(palette = c("nrc"), alpha = 1)(9)
  ) %>%
  layout(scene = list(
    xaxis = list(title = "Residue Accessible Surface Area in Tripeptide"),
    yaxis = list(title = "Polarizability Parameter"),
    zaxis = list(title = "Relative Mutability")
  ))

Visualizing Molecular Property descriptors

Code
molPropDescriptors <- readRDS(
  glue("{wkdir}/data/interim/protein_catalog/molPropDescriptors.rds")
)
pca_df <- molPropDescriptors %>%
  distinct() %>%
  mutate(catalog = strex::str_before_nth(id, "_", 2) %>% gsub("CATID_", "", .))

pca_df %>%
  plot_ly(
    x = ~scl1.lag7,
    y = ~scl2.lag7,
    z = ~scl3.lag7,
    text = ~ paste("ID:", id),
    mode = "markers", marker = list(size = 6)
  ) %>% add_markers(
    color = ~ catalog,
    colors = pal_npg(palette = c("nrc"), alpha = 1)(9)
  ) %>%
  layout(scene = list(
    xaxis = list(title = "Scale 1 - Lag 7"),
    yaxis = list(title = "Scale 2 - Lag 7"),
    zaxis = list(title = "Scale 3 - Lag 7")
  ))

Catalog Comparisons to UniRef90 and MGnify90

Downloading MGnify90 and UniRef90 for comparison

Code
wget_download_slurm(
  jobname = "mgy_clusters_fasta",
  slurm_out = "/central/scratch/jbok/slurmdump/",
  walltime = "4-00:00:00",
  download_link = "http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2023_02/mgy_clusters.fa.gz",
  output_dir = "/central/groups/MazmanianLab/joeB/Downloads/RefDBs/MGnify_2023_02/"
)

wget_download_slurm(
  jobname = "uniref90_fasta",
  slurm_out = "/central/scratch/jbok/slurmdump/",
  walltime = "4-00:00:00",
  download_link = "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz",
  output_dir = "/central/groups/MazmanianLab/joeB/Downloads/RefDBs/UniRef/"
)

Wrangling UniRef90 and MGnify90

Code
refdb_dir <- "/central/groups/MazmanianLab/joeB/Downloads/RefDBs"
scratch_dir <- "/central/scratch/jbok/mim_temp/clustering"
shell_do(glue("mkdir -p {scratch_dir}"))
custom_90p_path <- glue(
  "{protein_catalogs}/clustered_catalogs/merged/",
  "2023-02-13_catalog_MMSeq2-90_rep_seq.fasta"
  )

# copy gz fasta files into scratch dir
slurm_shell_do(
  glue("cp {refdb_dir}/MGnify_2023_02/mgy_clusters.fa.gz {scratch_dir}/"), 
  memory = "10G"
  )
slurm_shell_do(
  glue("cp {refdb_dir}/UniRef/uniref90.fasta.gz {scratch_dir}/"), 
  memory = "10G"
  )

# unzip fasta files && rename
slurm_shell_do(glue(
  "cd {scratch_dir} &&",
  " gunzip mgy_clusters.fa.gz &&",
  " mv mgy_clusters.fa MGnify90_custom90_merged.fasta"), 
  memory = "10G"
  )
slurm_shell_do(glue(
  "cd {scratch_dir} &&",
  " gunzip uniref90.fasta.gz &&",
  " mv uniref90.fasta UniRef90_custom90_merged.fasta"), 
  memory = "10G"
  )

# Concatenate CUSTOM 90% Catalog with UniRef90 and MGnify90
slurm_shell_do(
  glue("cat {custom_90p_path} >> {scratch_dir}/UniRef90_custom90_merged.fasta"), 
  memory = "10G"
  )
slurm_shell_do(
  glue("cat {custom_90p_path} >> {scratch_dir}/MGnify90_custom90_merged.fasta"), 
  memory = "10G"
  )

Clustering UniRef90 and MGnify90 + Custom catalog with MMSeq2

Code
# 90% Identity (kmer for seq 80) UniRef90
slurm_shell_do_mmseq2_clust(
  input_fasta = glue("{scratch_dir}/UniRef90_custom90_merged.fasta"),
  output = glue("{scratch_dir}/2023-04-17_MMSeq2-90_UniRef90_custom90"),
  walltime = 270000,
  id_threshold = 0.9,
  memory = "250G",
  ncpus = 4
)

# 90% Identity (kmer for seq 80) MGnify90
slurm_shell_do_mmseq2_clust(
  input_fasta = glue("{scratch_dir}/MGnify90_custom90_merged.fasta"),
  output = glue("{scratch_dir}/2023-04-17_MMSeq2-90_MGnify90_custom90"),
  walltime = 270000,
  id_threshold = 0.9,
  memory = "300G",
  ncpus = 4
)

Results

At 95% similarity, our catalog contains 164,949,326 unique protein sequences. This compendium represents the global human and ruminant gut microbiome genomic repertoire, in addition to common eukaryotic pathogens.

Code
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /central/home/jboktor/miniconda3/envs/pdmbsR/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggpointdensity_0.1.0     viridis_0.6.2            viridisLite_0.4.0       
 [4] ggpackets_0.2.1          ggside_0.2.1             data.table_1.14.2       
 [7] ggsci_2.9                plotly_4.10.1            kableExtra_1.3.4        
[10] protr_1.6-2              progress_1.2.2           listenv_0.8.0           
[13] tictoc_1.1               fs_1.5.2                 furrr_0.3.1             
[16] future.batchtools_0.10.0 future_1.26.1            seqinr_4.2-23           
[19] glue_1.6.2               magrittr_2.0.3           reticulate_1.26         
[22] forcats_0.5.1            stringr_1.4.1            dplyr_1.1.0             
[25] purrr_0.3.5              readr_2.1.2              tidyr_1.2.0             
[28] tibble_3.1.7             ggplot2_3.4.0            tidyverse_1.3.1         

loaded via a namespace (and not attached):
 [1] lubridate_1.8.0   webshot_0.5.4     httr_1.4.3        tools_4.2.0      
 [5] backports_1.4.1   utf8_1.2.2        R6_2.5.1          lazyeval_0.2.2   
 [9] DBI_1.1.2         colorspace_2.0-3  ade4_1.7-19       withr_2.5.0      
[13] gridExtra_2.3     tidyselect_1.2.0  prettyunits_1.1.1 compiler_4.2.0   
[17] cli_3.4.1         rvest_1.0.2       xml2_1.3.3        scales_1.2.0     
[21] checkmate_2.1.0   rappdirs_0.3.3    systemfonts_1.0.4 digest_0.6.30    
[25] rmarkdown_2.18    svglite_2.1.0     pkgconfig_2.0.3   htmltools_0.5.3  
[29] parallelly_1.31.1 highr_0.9         dbplyr_2.1.1      fastmap_1.1.0    
[33] strex_1.4.4       htmlwidgets_1.5.4 rlang_1.1.0       readxl_1.4.0     
[37] rstudioapi_0.13   farver_2.1.0      generics_0.1.2    jsonlite_1.8.3   
[41] crosstalk_1.2.0   Matrix_1.5-3      Rcpp_1.0.8.3      munsell_0.5.0    
[45] fansi_1.0.3       lifecycle_1.0.3   stringi_1.7.8     yaml_2.3.6       
[49] MASS_7.3-57       grid_4.2.0        parallel_4.2.0    crayon_1.5.2     
[53] lattice_0.20-45   haven_2.5.0       hms_1.1.1         batchtools_0.9.16
[57] knitr_1.40        pillar_1.8.1      base64url_1.4     codetools_0.2-18 
[61] reprex_2.0.1      evaluate_0.18     modelr_0.1.8      png_0.1-7        
[65] vctrs_0.6.0       tzdb_0.3.0        cellranger_1.1.0  gtable_0.3.0     
[69] assertthat_0.2.1  xfun_0.34         broom_0.8.0       globals_0.15.0   
[73] ellipsis_0.3.2    brew_1.0-8       

References

Almeida, Alexandre, Stephen Nayfach, Miguel Boland, Francesco Strozzi, Martin Beracochea, Zhou Jason Shi, Katherine S. Pollard, et al. 2021. “A Unified Catalog of 204,938 Reference Genomes from the Human Gut Microbiome.” Nature Biotechnology 39 (1): 105–14. https://doi.org/10.1038/s41587-020-0603-3.
Aurrecoechea, Cristina, Ana Barreto, John Brestelli, Brian P. Brunk, Shon Cade, Ryan Doherty, Steve Fischer, et al. 2013. “EuPathDB: The Eukaryotic Pathogen Database.” Nucleic Acids Research 41 (D1): D684–91. https://doi.org/10.1093/nar/gks1113.
Harris, Todd W, Valerio Arnaboldi, Scott Cain, Juancarlos Chan, Wen J Chen, Jaehyoung Cho, Paul Davis, et al. 2020. “WormBase: A Modern Model Organism Information Resource.” Nucleic Acids Research 48 (D1): D762–67. https://doi.org/10.1093/nar/gkz920.
Merrill, Bryan D., Matthew M. Carter, Matthew R. Olm, Dylan Dahan, Surya Tripathi, Sean P. Spencer, Brian Yu, et al. n.d. “Ultra-Deep Sequencing of Hadza Hunter-Gatherers Recovers Vanishing Gut Microbes.” https://doi.org/10.1101/2022.03.30.486478.
Nayfach, Stephen, David Páez-Espino, Lee Call, Soo Jen Low, Hila Sberro, Natalia N. Ivanova, Amy D. Proal, et al. 2021. “Metagenomic Compendium of 189,680 DNA Viruses from the Human Gut Microbiome.” Nature Microbiology 6 (7): 960–70. https://doi.org/10.1038/s41564-021-00928-6.
Stewart, Robert D., Marc D. Auffret, Amanda Warr, Alan W. Walker, Rainer Roehe, and Mick Watson. 2019. “Compendium of 4,941 Rumen Metagenome-Assembled Genomes for Rumen Microbiome Biology and Enzyme Discovery.” Nature Biotechnology 37 (8): 953–61. https://doi.org/10.1038/s41587-019-0202-3.
Wilkinson, Toby, Daniel Korir, Moses Ogugo, Robert D. Stewart, Mick Watson, Edith Paxton, John Goopy, and Christelle Robert. 2020. “1200 High-Quality Metagenome-Assembled Genomes from the Rumen of African Cattle and Their Relevance in the Context of Sub-Optimal Feeding.” Genome Biology 21 (1): 229. https://doi.org/10.1186/s13059-020-02144-7.
Xie, Fei, Wei Jin, Huazhe Si, Yuan Yuan, Ye Tao, Junhua Liu, Xiaoxu Wang, et al. 2021. “An Integrated Gene Catalog and over 10,000 Metagenome-Assembled Genomes from the Gastrointestinal Microbiome of Ruminants.” Microbiome 9 (1): 137. https://doi.org/10.1186/s40168-021-01078-x.
Zeng, Shuqin, Dhrati Patangia, Alexandre Almeida, Zhemin Zhou, Dezhi Mu, R. Paul Ross, Catherine Stanton, and Shaopu Wang. 2022. “A Compendium of 32,277 Metagenome-Assembled Genomes and over 80 Million Genes from the Early-Life Human Gut Microbiome.” Nature Communications 13 (1): 5139. https://doi.org/10.1038/s41467-022-32805-z.